Auto-Explore the Web – Web Crawler

نویسندگان

  • Soumick Chatterjee
  • Asoke Nath
چکیده

World Wide Web is an ever-growing public library with hundreds of millions of books without any central management system. Finding a piece of information without a proper directory is like finding a middle in a haystack. Various search engines solve this problem by indexing an amount of the complete content that is available in the internet. For accomplishing this job, search engines use an automated program, known as a web crawler. The most vital job of the web is information retrieval, that too with proper efficiency. Web Crawler helps to accomplish that, by helping search indexing or by helping in making archives. Web Crawler automatically visits all the available links which is further indexed. But, usage of web crawler is not limited to only search engines, but they can also be used for web scrapping, spam filtering, identifying unauthorized use of copyrighted content, identifying illegal and harmful web activities etc. Web Crawler faces various challenges while crawling deep web content, multimedia content etc. Various crawling techniques and various web crawlers are available and discussed in this paper.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

Analyzing new features of infected web content in detection of malicious web pages

Recent improvements in web standards and technologies enable the attackers to hide and obfuscate infectious codes with new methods and thus escaping the security filters. In this paper, we study the application of machine learning techniques in detecting malicious web pages. In order to detect malicious web pages, we propose and analyze a novel set of features including HTML, JavaScript (jQuery...

متن کامل

Online Social Honeynets: Trapping Web Crawlers in OSN

Web crawlers are complex applications that explore the Web with different purposes. Web crawlers can be configured to crawl online social networks (OSN) to obtain relevant data about its global structure. Before a web crawler can be launched to explore the web, a large amount of settings have to be configured. This settings define the behavior of the crawler and have a big impact on the collect...

متن کامل

Crawling the Web: Discovery and Maintenance of Large-scale Web Data

This dissertation studies the challenges and issues faced in implementing an effective Web crawler. A crawler is a program that retrieves and stores pages from the Web, commonly for a Web search engine. A crawler often has to download hundreds of millions of pages in a short period of time and has to constantly monitor and refresh the downloaded pages. In addition, the crawler should avoid putt...

متن کامل

jÄk: Using Dynamic Analysis to Crawl and Test Modern Web Applications

Web application scanners are popular tools to perform black box testing and are widely used to discover bugs in websites. For them to work effectively, they either rely on a set of URLs that they can test, or use their own implementation of a crawler that discovers new parts of a web application. Traditional crawlers would extract new URLs by parsing HTML documents and applying static regular e...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017